1 Getting Started

1.1 Setup

Welcome! While we’re waiting:

  • Clone or download the workshop files from: https://github.com/dlab-geo/
    • If you downloaded the zipfile, unzip it.
    • Make a note of the folder in which the files reside.
  • Open RStudio

  • Open a new R script file

1.2 Introduction

  • About me

  • About you
    • Your familiarity with US Census data
    • with geospatial data
    • with geospatial data in R

1.3 Outline

  • Describe primary Census data products

  • Introduce R packages for working with Census Data

  • Use those packages to fetch census data

  • Use those packages to fetch census data plus census geograpic boundary files

  • Make maps of census data

2 Census Data Overview

2.1 US Census Data

The “nation’s leading provider of quality data about its people and economy.”

Available at www.census.gov

2.2 Primary Census Products

  • Decennial Census

  • American Community Survey (ACS)

2.3 Decennial Census

Complete count of the population every 10 years since 1790

Includes data on

  • population, by age & race/ethnicity

  • housing, by occupancy & tenure (owned, rented)

2.4 American Community Survey (ACS)

  • Annual survey of a sample of about 3 million household

  • Provides estimates of demographic, social, economic & housing characteristics

  • Includes margin of error values for the estimates.

2.5 Decennial Census* vs ACS Data

Demographic* Social Economic Housing
Sex Families Income Tenure*
Age Education Benefits Occupancy*
Race Marital Status Employment Status Structure Type
Hispanic Origin Fertility Occupation Housing Value
Grandparents Industry Taxes & Insurance
Veterans Commuting Utilities
Disability Status Place of Work Mortgage
Language at Home Health Insurance Monthly Rent
Citizenship
Mobility

2.6 Census Geographies

Census data are publicly available at one or more levels of geographic aggregation.

2.7 Census Data & Census Geographies

2.9 Census Data Workflow

Identify your

  • topic of interest
  • year(s)
  • geographic level of detail
  • for what locations?

Then determine what specific tables and variables are available - ACS or Decennial?

2.10 CAUTION

“If you want to measure change you can’t change the measures!”

Census tables, variables, geographies, and geographic boundaries change over time!

Measuring change over time with census data is it’s own thing, complex and not covered by this workshop!

3 R Packages

3.1 Packages for Working with Census Data

These are the ones we recommend and will use today.

4 tidycensus & tigris

4.1 tidycensus

Functions for accessing census decennial and ACS 5 year datasets via Census APIs

  • only a subset of datasets / years available
  • requires a Census API key

4.2 tidycensus

Limited set of years available via tidycensus

  • decennial census: 1990, 2000, and 2010
  • ACS 5 yr: 2005-2010 through 2012-2016 are available.
  • Note: tidycensus referes to ACS 5year datasets by the endyear.
  • 2013 - 2017 released Dec 6, 2018 by the Census.
    • Need to check its availability in tidycensus.

4.3 tigris

Provides access to census geographic data files

  • detailed TIGER/Line boundary files (e.g., shapefiles), or
  • simplified Cartographic boundary files

Also provides access to census feature data,

  • eg, rivers, roads, coastlands, landmarks, and more

Used by tidycensus to access state, county, tract, block group, block, and ZCTA boundaries.

  • Use tigris directly to access other census geographic data.

4.4 tidycensus & tigris

Packages developed by Kyle Walker to make it easier to fetch data from Census websites and APIs in R and get that data in a useable format to analyze, plot, and map.

Check out his website to keep abreast of his great packages, blog posts and tutorials.

Walker also develped a new DataCamp course Analyzing US Census Data in R!

  • Highly recommended! First chapter free!

4.5 Alternatives

You can write code to access the Census APIs directly.

You can download Census data directly from:

You can download Census geographic data directly on the census website

4.6 tidyverse

A collection of R Packages for data science - developed primarily by Hadley Wickham, Chief Scientist at RStudio.

  • dplyr and tidyr for reshaping data

  • ggplot2 for plotting

  • purr, readr and tibble for improved performance

These packages are used by tidyverse under the hood.

4.7 sf

Simple features for geospatial data objects and methods.

  • Next generation R package for working with vector geospatial data
    • will soon supercede the sp package

sf includes the functionality of the sp, rgdal, rgeos and proj4 packages.

  • but with improved performance, simplified command syntax and easier workflows.

5 Tutorial Time!

5.1 Part 1.

We will work through several exercises using tidycensus to fetch, wrangle and map census data.

5.2 Loading packages

Load the packages we will use today

library(tidycensus)
library(tidyverse) 
library(tigris)
library(sf)

5.3 Install any packages that you do not have on your computer

Also install any dependancies.

# install.packages("tidyverse")
# install.packages("tidycensus")
# install.packages("sf")

5.4 Census API Key

You need a census API key to programmatically fetch census data.

Get it here (pretty quick):

For more info see:

5.5 Install your Census API Key

You can use the tidycensus function census_api_key to use your key in the current R session.

If you set install=TRUE R will save your key in the ~/.Renviron file.

# Install your census api key - long alphanumeric string
#
census_api_key("YOUR_KEY_HERE", install=T)    # Install it - once
readRenviron("~/.Renviron")                   # Read it - once
Sys.getenv("CENSUS_API_KEY")                  # check it - as needed

5.6 Set working directory

to the folder in which you unzipped the workshop files, e.g.,

setwd("~/Documents/Dlab/workshops/2018/rCensus_workshop")

6 Fetching Data from the Decennial Census

6.1 Population Data

Let’s start by fetching population data for each state from the 2010 Census

6.2 Identify Tables & Variables

First step is to identify the names of the census variables that contain the data of interest.

The tidycensus load_variables function can help with that.

First, take a look at the function documentation.

?load_variables

6.3 load_variables

Use load_variables to fetch all variables used in the 2010 census into a dataframe.

vars2010 <- load_variables(year=2010,        # Year or end year for ACS
                           dataset = 'sf1',  # 'sf1' for decennial or 'acs5'
                           cache = TRUE)     # Whether to save fetched data locally

6.4 Decennial Census Variables

Let’s take a look at and discuss the resultant dataframe.

  • How many 2010 census variables are in the table?
View(vars2010)

6.5 2010 Decennial Census Tables

  • Topics: Population, housing

  • Tables: currenty 333 - that’s a lot!
    • 177 population tables (identified with a ‘‘P’’) available to the block level
    • 58 housing tables (identified with an ‘‘H’’) available to the block level
    • 82 population tables (identified with a ‘‘PCT’’) available to the census tract level
    • 4 housing tables (identified with an “HCT”) available to the census tract level
    • 10 population tables (identified with a “PCO”) available to the county level
    • plus 2 additoinal PCT tables

6.6 get_decennial

get_decenial is the tidycensus function for fetching data from the decennial census API.

First, check the documentation for the function.

?get_decennial

6.7 get_decennial

Let’s fetch total population by state (P001001) from the 2010 census using get_decennial.

pop2010 <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010)           # census year
## Getting data from the 2010 decennial Census

6.8 View the Data

  • How many rows and columns?

  • What column has the population counts?

pop2010
## # A tibble: 52 x 4
##    GEOID NAME                 variable     value
##    <chr> <chr>                <chr>        <dbl>
##  1 01    Alabama              P001001   4779736.
##  2 02    Alaska               P001001    710231.
##  3 04    Arizona              P001001   6392017.
##  4 05    Arkansas             P001001   2915918.
##  5 06    California           P001001  37253956.
##  6 08    Colorado             P001001   5029196.
##  7 09    Connecticut          P001001   3574097.
##  8 10    Delaware             P001001    897934.
##  9 11    District of Columbia P001001    601723.
## 10 12    Florida              P001001  18801310.
## # ... with 42 more rows

6.9 Visualize results

ggplot2 is the most commonly used R package for data visualization.

It is loaded when you load the tidyverse package.

  • Knowlege of ggplot very useful for both exploratory analysis and communicating results.
pop_plot<- ggplot(data=pop2010, aes(x=reorder(NAME,value), y=value/1000000)) + 
  geom_bar(stat="identity") + coord_flip() +
  theme_minimal() + 
  labs(title = "2010 US Population by State") +
  xlab("State") +
  ylab("in millions")

6.10 Display the plot

6.11 Challenge

Fetch population data by state for 2000.

Don’t assume variable names are the same across years. Check first!

6.12 Total Population in 2000

# What is the variable name in 2000?
vars2000 <- load_variables(year=2000, dataset = 'sf1', cache = T)

# Take a look and search in the dataframe
View(vars2000)

# Fetch the 2000 pop data
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)

# Take a look
pop2000

6.13 Limiting by Area of Interest

In the previous example we retrieved population data for all states.

  • This is the default behavior if you don’t specify a subset.

  • You can limit the retrieved data by state or by county, if applicable.

6.14 Limit Areas of Interest

Let’s fetch data for just 3 states.

state_pop2010 <- get_decennial(geography = "state", # census tabulation unit
                         variables = "P001001",     # variables of interest
                         year = 2010,               # census year
                         state=c("CA","OR","WA"))   # Filter by states of interest
## Getting data from the 2010 decennial Census

6.15 Review Results

state_pop2010
## # A tibble: 3 x 4
##   GEOID NAME       variable     value
##   <chr> <chr>      <chr>        <dbl>
## 1 06    California P001001  37253956.
## 2 41    Oregon     P001001   3831074.
## 3 53    Washington P001001   6724540.

6.16 Changing Census Tabulation unit

get_decennial accepts a number of different values for tabulation unit.

  • Options include: state, county, tract, block group, block, and ZCTA.

Let’s change the tabulation unit to county.

co_pop2010 <- get_decennial(geography = "county",   # census tabulation unit
                            variables = "P001001",  # variables of interest
                            year = 2010)
## Getting data from the 2010 decennial Census

6.17 Changing Census Tabulation unit

View the Data to see what was retrieved.

str(co_pop2010)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3221 obs. of  4 variables:
##  $ GEOID   : chr  "01001" "01007" "01017" "01019" ...
##  $ NAME    : chr  "Autauga County, Alabama" "Bibb County, Alabama" "Chambers County, Alabama" "Cherokee County, Alabama" ...
##  $ variable: chr  "P001001" "P001001" "P001001" "P001001" ...
##  $ value   : num  54571 22915 34215 25989 25833 ...
head(co_pop2010, 2)
## # A tibble: 2 x 4
##   GEOID NAME                    variable  value
##   <chr> <chr>                   <chr>     <dbl>
## 1 01001 Autauga County, Alabama P001001  54571.
## 2 01007 Bibb County, Alabama    P001001  22915.
tail(co_pop2010, 2)
## # A tibble: 2 x 4
##   GEOID NAME                           variable  value
##   <chr> <chr>                          <chr>     <dbl>
## 1 72151 Yabucoa Municipio, Puerto Rico P001001  37941.
## 2 72153 Yauco Municipio, Puerto Rico   P001001  42043.

6.18 Challenge

  1. Fetch population by county for just California

  2. Fetch population by county for Oregon & California

6.19 Challenge

  1. Fetch population by tract for all states.

  2. Fetch population by tract for California.

6.20 Fetching Census Tract Data

If you want census data at the tract level or below you need to specifiy the state & county or counties.

tract_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year
                         state="CA",                  # limit to state of California
                         county=c("Alameda","Contra Costa"))  # and only these counties
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census

6.21 Fetching Census Tract Data

View the results! How many census tracts are in these 3 counties?

tract_pop2010
## # A tibble: 569 x 4
##    GEOID       NAME                                         variable value
##    <chr>       <chr>                                        <chr>    <dbl>
##  1 06001400100 Census Tract 4001, Alameda County, Californ… P001001  2937.
##  2 06001400200 Census Tract 4002, Alameda County, Californ… P001001  1974.
##  3 06001400300 Census Tract 4003, Alameda County, Californ… P001001  4865.
##  4 06001400400 Census Tract 4004, Alameda County, Californ… P001001  3703.
##  5 06001400500 Census Tract 4005, Alameda County, Californ… P001001  3517.
##  6 06001400600 Census Tract 4006, Alameda County, Californ… P001001  1571.
##  7 06001400700 Census Tract 4007, Alameda County, Californ… P001001  4206.
##  8 06001400800 Census Tract 4008, Alameda County, Californ… P001001  3594.
##  9 06001400900 Census Tract 4009, Alameda County, Californ… P001001  2302.
## 10 06001401000 Census Tract 4010, Alameda County, Californ… P001001  5678.
## # ... with 559 more rows

6.22 Challenge

  1. Fetch population by county for Alameda County, California

  2. Fetch population by tract for the nine county Bay Area:
  • Alameda, SF, Contra Costa, Marin County, Napa,
  • San Mateo, Santa Clara, Solano, Sonoma, Santa Cruz

6.23 Challenge hint

You can use names or FIPs codes as your state and county function arguments.

# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

# Your code below...
bayarea_pop10 <-...

6.24 Fetching data for more than one census variable

What three things are new here?

ur_pop10 <- get_decennial(geography = "county",  # census tabulation unit
                           variables = c(urban="P002002",rural="P002005"),
                           year = 2010, 
                           summary_var = "P002001",  # The denominator
                           state='CA',
                           county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census

6.25 Fetching data for more than one census variable

What three things are new here?

  1. You can specify more than one variable:

    variables = c("P002002","P002005")
  2. You can name the output columns.

    variables = c(urban="P002002",rural="P002005")
  3. You can identify a summary_var. This value is the denominator - the total count of all people or households surveyed. The values in this column will be used as a demoninator for other calcuations like percent of total.

    summary_var = "P002001"

6.26 Take a look at the results

ur_pop10
## # A tibble: 6 x 5
##   GEOID NAME                         variable   value summary_value
##   <chr> <chr>                        <chr>      <dbl>         <dbl>
## 1 06045 Mendocino County, California urban     48110.        87841.
## 2 06055 Napa County, California      urban    118194.       136484.
## 3 06097 Sonoma County, California    urban    424102.       483878.
## 4 06045 Mendocino County, California rural     39731.        87841.
## 5 06055 Napa County, California      rural     18290.       136484.
## 6 06097 Sonoma County, California    rural     59776.       483878.

6.27 Calculating Percents

The summary_value column comes in handy when you want to compute percent of total.

Here’s one way to do it.

ur_pop10 <- ur_pop10 %>%
            mutate(pct = 100 * (value / summary_value))
## Warning: package 'bindrcpp' was built under R version 3.4.4

6.28 Calculating Percents

Let’s take a look at the output

ur_pop10 # Take a look
## # A tibble: 6 x 6
##   GEOID NAME                         variable   value summary_value   pct
##   <chr> <chr>                        <chr>      <dbl>         <dbl> <dbl>
## 1 06045 Mendocino County, California urban     48110.        87841.  54.8
## 2 06055 Napa County, California      urban    118194.       136484.  86.6
## 3 06097 Sonoma County, California    urban    424102.       483878.  87.6
## 4 06045 Mendocino County, California rural     39731.        87841.  45.2
## 5 06055 Napa County, California      rural     18290.       136484.  13.4
## 6 06097 Sonoma County, California    rural     59776.       483878.  12.4

6.29 Plot it

Plots give us compact visual summaries of the data

myplot <- ggplot(data = ur_pop10, 
          mapping = aes(x = NAME, fill = variable, 
                     y = ifelse(test = variable == "urban", 
                                yes = -pct, no = pct))) +
          geom_bar(stat = "identity") +
          scale_y_continuous(labels = abs, limits=c(-100,100)) +
          labs(title="Urban & Rural Population in Wine Country", 
               x="County", y = " Percent of Population", fill="") +
          coord_flip()

6.30 Plot it

myplot

6.31 Fetch all the data in one table

This is often helpful but you need to keep tract of the meaning of each variable.

alco_pop10 <- get_decennial(geography = "tract", # Census tabulation unit
                           table =  "P002",      # Table of urban & rural population counts
                           year = 2010,          # Decennial census year
                           state='CA',           # Filter state
                           county="Alameda")     # Filter county
## Getting data from the 2010 decennial Census

6.32 Take a look

unique(alco_pop10$variable)
## [1] "P002001" "P002002" "P002003" "P002004" "P002005" "P002006"
alco_pop10
## # A tibble: 2,166 x 4
##    GEOID       NAME                                         variable value
##    <chr>       <chr>                                        <chr>    <dbl>
##  1 06001400100 Census Tract 4001, Alameda County, Californ… P002001  2937.
##  2 06001400200 Census Tract 4002, Alameda County, Californ… P002001  1974.
##  3 06001400300 Census Tract 4003, Alameda County, Californ… P002001  4865.
##  4 06001400400 Census Tract 4004, Alameda County, Californ… P002001  3703.
##  5 06001400500 Census Tract 4005, Alameda County, Californ… P002001  3517.
##  6 06001400600 Census Tract 4006, Alameda County, Californ… P002001  1571.
##  7 06001400700 Census Tract 4007, Alameda County, Californ… P002001  4206.
##  8 06001400800 Census Tract 4008, Alameda County, Californ… P002001  3594.
##  9 06001400900 Census Tract 4009, Alameda County, Californ… P002001  2302.
## 10 06001401000 Census Tract 4010, Alameda County, Californ… P002001  5678.
## # ... with 2,156 more rows

6.33 Fetching Data for multiple years

You cannot do this with one tidycensus function call unless you wrap it in a function.

  • See Extra section at the end of tutorial.
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)
pop2010 <- get_decennial(geography = "state", variables = "P001001", year = 2010)

6.34 Output options

Let’s try all three of these commands and then look at the ouput to see what’s different?

pop2010a <- get_decennial(geography = "state", variables = "P001001", year = 2010)
pop2010b <- get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010)
pop2010c <- get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2010, output="wide")

6.35 Data Wrangling

Your R skills can help you reformat the data and make it more useable.

First, fetch population data for 2010 & 2000 by state with output=wide.

Label the variables pop00 and pop10.

6.36 Data Wrangling

pop2000 <- get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2000, output="wide")
## Getting data from the 2000 decennial Census
pop2010 <- get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010, output="wide")
## Getting data from the 2010 decennial Census

6.37 Merge population by state from both censuses

pop2000_2010 <- pop2000 %>% merge(pop2010, by="NAME") %>%
                             select(NAME, pop00, pop10)

6.38 Data Wrangling

View results

pop2000_2010
##                    NAME    pop00    pop10
## 1               Alabama  4447100  4779736
## 2                Alaska   626932   710231
## 3               Arizona  5130632  6392017
## 4              Arkansas  2673400  2915918
## 5            California 33871648 37253956
## 6              Colorado  4301261  5029196
## 7           Connecticut  3405565  3574097
## 8              Delaware   783600   897934
## 9  District of Columbia   572059   601723
## 10              Florida 15982378 18801310
## 11              Georgia  8186453  9687653
## 12               Hawaii  1211537  1360301
## 13                Idaho  1293953  1567582
## 14             Illinois 12419293 12830632
## 15              Indiana  6080485  6483802
## 16                 Iowa  2926324  3046355
## 17               Kansas  2688418  2853118
## 18             Kentucky  4041769  4339367
## 19            Louisiana  4468976  4533372
## 20                Maine  1274923  1328361
## 21             Maryland  5296486  5773552
## 22        Massachusetts  6349097  6547629
## 23             Michigan  9938444  9883640
## 24            Minnesota  4919479  5303925
## 25          Mississippi  2844658  2967297
## 26             Missouri  5595211  5988927
## 27              Montana   902195   989415
## 28             Nebraska  1711263  1826341
## 29               Nevada  1998257  2700551
## 30        New Hampshire  1235786  1316470
## 31           New Jersey  8414350  8791894
## 32           New Mexico  1819046  2059179
## 33             New York 18976457 19378102
## 34       North Carolina  8049313  9535483
## 35         North Dakota   642200   672591
## 36                 Ohio 11353140 11536504
## 37             Oklahoma  3450654  3751351
## 38               Oregon  3421399  3831074
## 39         Pennsylvania 12281054 12702379
## 40         Rhode Island  1048319  1052567
## 41       South Carolina  4012012  4625364
## 42         South Dakota   754844   814180
## 43            Tennessee  5689283  6346105
## 44                Texas 20851820 25145561
## 45                 Utah  2233169  2763885
## 46              Vermont   608827   625741
## 47             Virginia  7078515  8001024
## 48           Washington  5894121  6724540
## 49        West Virginia  1808344  1852994
## 50            Wisconsin  5363675  5686986
## 51              Wyoming   493782   563626

6.39 Quick Combo Plot

To check my data values

ggplot() +
  geom_bar(data=pop2000_2010, aes(x=reorder(NAME,pop10), y=pop10/1000000), stat="identity") + coord_flip() +
  geom_point(data=pop2000_2010, aes(x=reorder(NAME,pop10), y=pop00/1000000, col="red")) + coord_flip()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

6.40 Save the data

Use write.csv to save a data frame to a CSV file.

write.csv(pop2000_2010, file="pop2000_2010.csv", row.names = FALSE)

7 QUESTIONS?

8 Part 2. Mapping

8.1 Mapping Census Data with tidycensus

We can fetch geographic data with tidycensus by adding geometry=TRUE to our function calls.

Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.

Only a subset of data available via tigris can be accessed via tidycensus.

8.2 Geometry Options

Before fetching geometry, we need to specify a few tigris options

  • Set the class of returned data to be sf objects (not sp, the default)
  • Set tigris_use_cache to TRUE
  • Set cache directory - Set to a folder on your computer
# Tigris options - used by tidycensus
options(tigris_class = "sf")      # SP is the default format returned by tigris
options(tigris_use_cache = TRUE)  # Save retrieved data locally
tigris_cache_dir("~/Documents/gis_data/census")  # Folder for local data
## Your new tigris cache directory is ~/Documents/gis_data/census. 
## To use now, restart R or run `readRenviron('~/.Renviron')`
Sys.getenv('TIGRIS_CACHE_DIR') # Check it
## [1] "~/Documents/gis_data/census"

8.3 Fetch geographic boundary data with tidycensus

We fetch the geospatial data by setting geometry=TRUE.

pop2010geo <- get_decennial(geography = "state", 
                          variables = c(pop10="P001001"), 
                          year = 2010, 
                          output="wide", 
                          geometry=TRUE) # Fetch geometry with the data for mapping
## Getting data from the 2010 decennial Census

8.4 Take a look

Let’s take a minute to discuss the format of an sf spatial object.

pop2010geo
## Simple feature collection with 52 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 52 x 4
##    GEOID NAME                     pop10                           geometry
##    <chr> <chr>                    <dbl>                 <MULTIPOLYGON [°]>
##  1 01    Alabama               4779736. (((-85.00237 31.00068, -85.02411 …
##  2 02    Alaska                 710231. (((-164.9762 54.13459, -164.9378 …
##  3 04    Arizona               6392017. (((-109.0452 36.99908, -109.0452 …
##  4 05    Arkansas              2915918. (((-94.55929 36.4995, -94.51948 3…
##  5 06    California           37253956. (((-122.4463 37.86105, -122.4386 …
##  6 08    Colorado              5029196. (((-102.0422 36.99308, -102.0545 …
##  7 09    Connecticut           3574097. (((-71.85957 41.3224, -71.86823 4…
##  8 10    Delaware               897934. (((-75.55945 39.62981, -75.5591 3…
##  9 11    District of Columbia   601723. (((-77.0386 38.79151, -77.0389 38…
## 10 12    Florida              18801310. (((-85.15641 29.67963, -85.1374 2…
## # ... with 42 more rows

8.5 Take a look at the file on our local computer

Because we set tigris_use_cache = TRUE, the geographic data are saved locally as an ESRI Shapefile, the most common geospatial data file format. We can use this file for other geospatial mapping and analysis tasks since we have it stored locally. Let’s check it out.

First, check to see what the directory is.

my_cache_dir <- Sys.getenv('TIGRIS_CACHE_DIR') 

dir(my_cache_dir) # What files stored there?

8.6 Geospatial Data in R

R sf objects include

  • geometry of type (MULTI) POINT, LINE, POLYGON
    • in the dataframe column named geometry
  • a dataframe of attributes

  • a CRS (coordinate reference system), specified by
    • epsg(SRID) code
    • proj4string

8.7 Census Data CRS

All census geographic data use the NAD83 CRS, or coordinate reference system.

NAD83 stands for North American Datum of 1983. The geographic coordinates are longitude and latitude values encoded as decimal degrees.

WGS84, or The World Geodetic System of 1984 is the most commonly used geographic CRS. The difference between points encoded in these two systems can vary, on average, up to 1 meter in the continental US.

Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.

As an in depth discussion of CRSs is outside the scope of this workshop, see Geocomputation in R for more information.

8.8 Mapping sf Spatial Objects

We can use plot to make a quick map the geometry stored in an sf spatial object.

plot(pop2010geo$geometry)

8.9 Question

What do you get if you plot the sf object without specifying “$geometry”

8.10 The Challenge of US maps

Almost all of the USA is between -66 and -180 degrees latitude.

But Alaska includes a few islands that are between 180-172 degrees latitude.

This makes it difficult to map the US centered at 0,0 lat/lon.

8.11 Fetch geographic data with tidycensus, SHIFTED

tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.

pop2010geo_shifted <- get_decennial(geography = "state", 
                                    variables = c(pop10="P001001"), 
                                    output="wide",
                                    year = 2010, 
                                    geometry=TRUE, 
                                    shift_geo=TRUE)
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Please note: Alaska and Hawaii are being shifted and are not to scale.

8.12 Shift Happens!

plot(pop2010geo_shifted$geometry)

8.13 Save it

You can save sf data to a shapefile using st_write

st_write(pop2010geo_shifted,"usa_2010_shifted.shp")

8.14 Check your TIGRIS_CACHE_DIR to see it

my_cache_dir <- Sys.getenv('TIGRIS_CACHE_DIR') 

dir(my_cache_dir) # What files stored there?

8.15 Mapping Data Values

plot(pop2010geo_shifted['pop10'])

8.16 ggplot2 Maps

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()

8.17 Fancier

ggplot(pop2010geo_shifted, aes(fill = pop10/1000000, color = pop10/1000000)) + 
  geom_sf() +
  scale_fill_viridis_c() + scale_color_viridis_c(guide = FALSE) +
  theme_minimal() +
  labs(title = "Population by State, 2010", fill="Population (millions)")

8.18 Challenge

Create a map of CA Population in 2010 by county

8.19 2010 pop Data for California Counties

cal_pop10 <- get_decennial(geography = "county", 
                           variables = "P001001",
                           year = 2010, 
                           state='CA',
                           geometry=TRUE)
## Getting data from the 2010 decennial Census

8.20 Map it

plot(cal_pop10['value'])

8.21 Fetch County data for more than one state

We can fetch both the census data and geometry for more than one state.

west_pop10 <- get_decennial(geography = "county", 
                           variables =  "P001001",
                           year = 2010, 
                           state=c('CA','OR','NV',"AZ"),
                           geometry=T)
## Getting data from the 2010 decennial Census

8.22 Census Tract Data

Fetching the data for all tracts in one state.

Need to specify county!.

# Optional state & county argments to get more limited geo subset
alco_pop10 <- get_decennial(geography = "tract", 
                           variables = "P001001", 
                           year = 2010, 
                           state='CA',
                           county='Alameda',
                           geometry=T)
## Getting data from the 2010 decennial Census

8.23 Challenge

Fetch and map the 2010 population by census tract for Alameda and Countra Costa counties.

8.24 Mapping Tract data

alcc_pop10 <- get_decennial(geography = "tract", 
                      variables = "P001001", 
                      year = 2010, 
                      state='CA',
                      county=c("Alameda","Contra Costa"),
                      geometry=T) 
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
plot(alcc_pop10['value'])

8.25 Challenge

Fetch and map the percent of San Francicso properties by census tract that were coded as rented in the 2010 Census.

To start, indentify the variables for the

  • total number of hounsing units

  • number of renter occupied units

8.26 SF Rented Units, 2010

sf_rented <- get_decennial(geography = "tract",  # census tabulation unit
                           variables =  "H004004",
                           year = 2010, 
                           summary_var = "H004001",  # Total Urban - the denominator
                           state='CA',
                           county='San Francisco',
                           geometry=T)

sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
                 mutate(pct = 100 * (value / summary_value))

plot(sf_pct_rented['pct'])

9 Questions?

10 Part 3. ACS 5 year data

10.1 ACS 5 year

You can use the tidycensus get_acs function to retrieve data for the ACS 5 year products, beginning with the 2005 - 2010 dataset.

The default end year in the current version of tidycensus (as of Dec 4, 2018) is 2016 for the 2012-2016 ACS 5 year dataset.

get_acs syntax and options are very similar to get_decennial.

But there are many more data variables!

10.2 Fetch List of ACS 5 year Variables

Let’s start by fetching ACS 5-year 2016 data on poverty.

We want to explore the number of folks living below the poverty level by census tract.

First we need to find the variable name(s)!

10.3 Load ACS Table Vars

Load the ACS 2012-2016 5 year data variables into a dataframe.

  • ACS 5 year datasets are referenced by end year in tidycensus!

Then search for the variable names.

How many variables refer to the concept of poverty?

acs2016vars <- load_variables(year=2016, dataset = 'acs5', cache = T)
#View(acs2016vars)

10.4 ACS Tables and variables

Many hundreds (thousands?) more than for decennial census!

See the documentation on the census website

Types of tables:

  • B prefix = base tables
  • C = collapsed tables
  • DP = data profiles
  • S = Subject tables

10.5 Census Reporter

ACS variables can be confusing.

The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.

Let’s check it out to see what tables/variables we should use.

10.6 get_acs

Use the tidycensus get_acs function to fetch the poverty data for census tracts in San Francisco

?get_acs

10.7 get_acs in action

sf_poor <- get_acs(geography = "tract",  
                   table = "C17002",     # Table: ratio income to poverty
                   year = 2016,          
                   state="CA",
                   summary_var = "C17002_001", # Est of num people - denom
                   county="San Francisco",
                   geometry=T)               
## Getting data from the 2012-2016 5-year ACS

10.8 View output

Let’s take a look at the output of get_acs and discuss how it differs from get_decennial.

 sf_poor
## Simple feature collection with 1576 features and 7 fields (with 8 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.0139 ymin: 37.69274 xmax: -122.3276 ymax: 37.86334
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                               NAME
## 1  06075010100 Census Tract 101, San Francisco County, California
## 2  06075010100 Census Tract 101, San Francisco County, California
## 3  06075010100 Census Tract 101, San Francisco County, California
## 4  06075010100 Census Tract 101, San Francisco County, California
## 5  06075010100 Census Tract 101, San Francisco County, California
## 6  06075010100 Census Tract 101, San Francisco County, California
## 7  06075010100 Census Tract 101, San Francisco County, California
## 8  06075010100 Census Tract 101, San Francisco County, California
## 9  06075010200 Census Tract 102, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
##      variable estimate moe summary_est summary_moe
## 1  C17002_001     3972 291        3972         291
## 2  C17002_002      314 196        3972         291
## 3  C17002_003      397 185        3972         291
## 4  C17002_004      120 130        3972         291
## 5  C17002_005      236 129        3972         291
## 6  C17002_006      123 112        3972         291
## 7  C17002_007      444 304        3972         291
## 8  C17002_008     2338 387        3972         291
## 9  C17002_001     4300 442        4300         442
## 10 C17002_002      160 123        4300         442
##                          geometry
## 1  MULTIPOLYGON (((-122.4206 3...
## 2  MULTIPOLYGON (((-122.4206 3...
## 3  MULTIPOLYGON (((-122.4206 3...
## 4  MULTIPOLYGON (((-122.4206 3...
## 5  MULTIPOLYGON (((-122.4206 3...
## 6  MULTIPOLYGON (((-122.4206 3...
## 7  MULTIPOLYGON (((-122.4206 3...
## 8  MULTIPOLYGON (((-122.4206 3...
## 9  MULTIPOLYGON (((-122.4267 3...
## 10 MULTIPOLYGON (((-122.4267 3...

10.9 map it

First remove census tracts that have no people!

sf_poor2 <- subset(sf_poor, summary_est > 0)
plot(sf_poor2['estimate'])

10.10 Calculating percents

Let’s calculate the percent below poverty by tract.

sf_poor3 <- sf_poor2 %>%
  mutate(pct = 100 * (estimate / summary_est))

sf_poor3
## Simple feature collection with 1560 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.5145 ymin: 37.70813 xmax: -122.3276 ymax: 37.86334
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                               NAME
## 1  06075010100 Census Tract 101, San Francisco County, California
## 2  06075010100 Census Tract 101, San Francisco County, California
## 3  06075010100 Census Tract 101, San Francisco County, California
## 4  06075010100 Census Tract 101, San Francisco County, California
## 5  06075010100 Census Tract 101, San Francisco County, California
## 6  06075010100 Census Tract 101, San Francisco County, California
## 7  06075010100 Census Tract 101, San Francisco County, California
## 8  06075010100 Census Tract 101, San Francisco County, California
## 9  06075010200 Census Tract 102, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
##      variable estimate moe summary_est summary_moe        pct
## 1  C17002_001     3972 291        3972         291 100.000000
## 2  C17002_002      314 196        3972         291   7.905337
## 3  C17002_003      397 185        3972         291   9.994965
## 4  C17002_004      120 130        3972         291   3.021148
## 5  C17002_005      236 129        3972         291   5.941591
## 6  C17002_006      123 112        3972         291   3.096677
## 7  C17002_007      444 304        3972         291  11.178248
## 8  C17002_008     2338 387        3972         291  58.862034
## 9  C17002_001     4300 442        4300         442 100.000000
## 10 C17002_002      160 123        4300         442   3.720930
##                          geometry
## 1  MULTIPOLYGON (((-122.4206 3...
## 2  MULTIPOLYGON (((-122.4206 3...
## 3  MULTIPOLYGON (((-122.4206 3...
## 4  MULTIPOLYGON (((-122.4206 3...
## 5  MULTIPOLYGON (((-122.4206 3...
## 6  MULTIPOLYGON (((-122.4206 3...
## 7  MULTIPOLYGON (((-122.4206 3...
## 8  MULTIPOLYGON (((-122.4206 3...
## 9  MULTIPOLYGON (((-122.4267 3...
## 10 MULTIPOLYGON (((-122.4267 3...

10.11 subset to isolate vars of interest

We want to subset on and then combine the percents of people in each tract who are below 50% and 99% of the poverty line.

sf_poor4 <- subset(sf_poor3, (variable %in% c("C17002_002","C17002_003")))
head(sf_poor4)
## Simple feature collection with 6 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79847 xmax: -122.3996 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##          GEOID                                               NAME
## 2  06075010100 Census Tract 101, San Francisco County, California
## 3  06075010100 Census Tract 101, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
## 11 06075010200 Census Tract 102, San Francisco County, California
## 18 06075010300 Census Tract 103, San Francisco County, California
## 19 06075010300 Census Tract 103, San Francisco County, California
##      variable estimate moe summary_est summary_moe      pct
## 2  C17002_002      314 196        3972         291 7.905337
## 3  C17002_003      397 185        3972         291 9.994965
## 10 C17002_002      160 123        4300         442 3.720930
## 11 C17002_003       75  55        4300         442 1.744186
## 18 C17002_002      194 102        4341         426 4.469016
## 19 C17002_003      431 204        4341         426 9.928588
##                          geometry
## 2  MULTIPOLYGON (((-122.4206 3...
## 3  MULTIPOLYGON (((-122.4206 3...
## 10 MULTIPOLYGON (((-122.4267 3...
## 11 MULTIPOLYGON (((-122.4267 3...
## 18 MULTIPOLYGON (((-122.4187 3...
## 19 MULTIPOLYGON (((-122.4187 3...

10.12 Group by and sum

sf_poor5 <- sf_poor4 %>%
  select(GEOID, pct, geometry) %>%
  group_by(GEOID) %>% 
  summarise(pct_below_pov = sum(pct))

head(sf_poor5)
## Simple feature collection with 6 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79323 xmax: -122.3897 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 3
##   GEOID       pct_below_pov                                       geometry
##   <chr>               <dbl>                                  <POLYGON [°]>
## 1 06075010100         17.9  ((-122.4206 37.81111, -122.4075 37.81144, -12…
## 2 06075010200          5.47 ((-122.4267 37.80964, -122.4249 37.8108, -122…
## 3 06075010300         14.4  ((-122.4187 37.80593, -122.4169 37.80521, -12…
## 4 06075010400          7.93 ((-122.4149 37.80354, -122.4116 37.80396, -12…
## 5 06075010500          8.53 ((-122.4052 37.80476, -122.4035 37.80509, -12…
## 6 06075010600         24.7  ((-122.411 37.80117, -122.4078 37.80157, -122…

10.13 Map it

Where are SF’s poorest areas?

plot(sf_poor5['pct_below_pov'])

11 Interactive Maps with tmap

11.1 tmap

The tmap package is great for making both static and interactive maps. It turns R into a GIS.

Let’s check it out with our last dataframe.

11.2 tmap

library(tmap)
## Warning: package 'tmap' was built under R version 3.4.4
tmap_mode("view") # set mode to interactive
## tmap mode set to interactive viewing
poverty_map <- tm_shape(sf_poor5) +
                  tm_polygons(col="pct_below_pov")

11.3 tmap

View the map - click on tracts

poverty_map

11.4 tmap

There are a number of great tutorials online for working with tmap.

See the References at the end of this workshop document.

11.5 Types of Census Geographic Data

Cartographic Boundary data vs Detailed TIGER/Line data

By default, tidycensus downloads census cartographic boundary data.

These are simplifed geometries, clipped to coastlines.

In get_acs you can also request the more detailed census TIGER data.

Let’s check it out.

11.6 Fetch Cartographic Boundary Data

sf_poor_cb <- get_acs(geography = "tract",   
                   table = "C17002", 
                   summary_var = "C17002_001",
                   year = 2016,           
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = TRUE)     # THIS IS THE DEFAULT!
## Getting data from the 2012-2016 5-year ACS

11.7 Fetch Detailed TIGER/Line Geometry

sf_poor_tl <- get_acs(geography = "tract",   
                   table = "C17002",        
                   summary_var = "C17002_001",
                   year = 2016,              
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = FALSE)  # Fetching the TIGER/Line data  
## Getting data from the 2012-2016 5-year ACS

11.8 Map it

par(mfrow=c(1,2))
plot(sf_poor_cb[sf_poor$summary_est>0,]$geometry)
plot(sf_poor_tl[sf_poor$summary_est>0,]$geometry)

par(mfrow=c(1,1))

11.9 Visualize differences with Tmap

zoom in to explore, especially around the coastline.

tm_shape(sf_poor_tl) +
    tm_borders() +
tm_shape(sf_poor_cb) +
    tm_borders(col="red")

11.10 Challenge

The ACS 2013-2017 5 year dataset was released Dec 6, 2018. Although the current version of tidycensus states that 2012-2016 is the latest ACS 5-year product, see if you can fetch & map the percent of people below poverty line in San Francisco using the 2013-2017 ACS 5-year data.

12 Questions?

12.1 Scaling Up Example

In this example we show you how you can read in census variables of interest from a file into an R dataframe. You can then use that dataframe to fetch data for all those variables using tidycensus.

# Load cenvar lookup table of vars of interest
my_cenvar_df <-read.csv("data/cenvar_lookup.csv", strip.white = T, stringsAsFactors = F)

my_cenvar_df
##              my_cen_var_names my_cen_vars
## 1          citizenship_totpop B05001_001E
## 2     citizenship_non_citizen B05001_006E
## 3                entry_totpop B05005_001E
## 4                  entry_2010 B05005_002E
## 5             entry_2000_2009 B05005_007E
## 6           birthplace_totpop B05007_001E
## 7            birthplace_europ B05007_014E
## 8            birthplace_asian B05007_027E
## 9     birthplace_latinAmerica B05007_040E
## 10    birthplace_southAmerica B05007_081E
## 11    birthplace_other_nonUSA B05007_094E
## 12    birthplace_byage_totpop B06001_001E
## 13     birthplace_byage_fborn B06001_049E
## 14             poverty_totpop B06012_001E
## 15                  below_pov B06012_002E
## 16                 below_pov2 B06012_003E
## 17       poverty_fborn_totpop B06012_017E
## 18            below_pov_fborn B06012_018E
## 19           below_pov2_fborn B06012_019E
## 20       health_native_totpop B27020_002E
## 21  health_native_noinsurance B27020_006E
## 22    health_fborn_nat_totpop B27020_008E
## 23 fborn_nohealth_naturalized B27020_012E
## 24 health_fborn_noncit_totpop B27020_013E
## 25  fborn_nohealth_noncitizen B27020_017E

12.2 Fetch the ACS data

Fetch the ACS data for these variables for the 9 county bay area

nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")
bay9_data <-get_acs(geography = "tract", 
                       variables = my_cenvar_df$my_cen_vars, 
                       year=2016,
                       state = "CA", 
                       county = nine_counties, 
                       geometry = T)
## Getting data from the 2012-2016 5-year ACS
bay9_data
## Simple feature collection with 39700 features and 5 fields (with 150 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                          NAME   variable
## 1  06001400100 Census Tract 4001, Alameda County, California B05001_001
## 2  06001400100 Census Tract 4001, Alameda County, California B05001_006
## 3  06001400100 Census Tract 4001, Alameda County, California B05005_001
## 4  06001400100 Census Tract 4001, Alameda County, California B05005_002
## 5  06001400100 Census Tract 4001, Alameda County, California B05005_007
## 6  06001400100 Census Tract 4001, Alameda County, California B05007_001
## 7  06001400100 Census Tract 4001, Alameda County, California B05007_014
## 8  06001400100 Census Tract 4001, Alameda County, California B05007_027
## 9  06001400100 Census Tract 4001, Alameda County, California B05007_040
## 10 06001400100 Census Tract 4001, Alameda County, California B05007_081
##    estimate moe                       geometry
## 1      3018 195 MULTIPOLYGON (((-122.2469 3...
## 2       218 106 MULTIPOLYGON (((-122.2469 3...
## 3       944 168 MULTIPOLYGON (((-122.2469 3...
## 4        61  50 MULTIPOLYGON (((-122.2469 3...
## 5       176  99 MULTIPOLYGON (((-122.2469 3...
## 6       843 156 MULTIPOLYGON (((-122.2469 3...
## 7       208  75 MULTIPOLYGON (((-122.2469 3...
## 8       546 141 MULTIPOLYGON (((-122.2469 3...
## 9        46  43 MULTIPOLYGON (((-122.2469 3...
## 10       11  17 MULTIPOLYGON (((-122.2469 3...

12.3 Reformat Ouput

  1. We only want to keep the estimate column for each variable of interest, plus the GEOID and geometry columns.

  2. We then want to make the data wide using the spread function. This will put each estimate variable is in its own column.

bay9_data2 <- bay9_data %>%
  select("GEOID", "variable", "estimate") %>%
  spread(key=variable, value=estimate)

12.4 Take a look

bay9_data2
## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID B05001_001 B05001_006 B05005_001 B05005_002 B05005_007
## 1  06001400100       3018        218        944         61        176
## 2  06001400200       1960         92        315         33         55
## 3  06001400300       5236        317        900        130        314
## 4  06001400400       4171        190        524         69        115
## 5  06001400500       3748        430        701        226        123
## 6  06001400600       1661         62        205         21          9
## 7  06001400700       4552        353        619        122        177
## 8  06001400800       3506        276        767        108        185
## 9  06001400900       2262        202        446         20         16
## 10 06001401000       6193        477        754         90        186
##    B05007_001 B05007_014 B05007_027 B05007_040 B05007_081 B05007_094
## 1         843        208        546         46         11         43
## 2         243         65        136         28         19         14
## 3         857        119         90        257         86        391
## 4         471        146        228         35         15         62
## 5         635        160         83        245         37        147
## 6         178         49         74         38          0         17
## 7         587         70        145        299         67         73
## 8         741        181        330        149          8         81
## 9         405         20        136        148         38        101
## 10        671         56         46        519         67         50
##    B06001_001 B06001_049 B06012_001 B06012_002 B06012_003 B06012_017
## 1        3018        843       3011        113         20        843
## 2        1960        243       1952        106         84        243
## 3        5236        857       5153        450        217        848
## 4        4171        471       4158        268        198        471
## 5        3748        635       3733        339        240        635
## 6        1661        178       1656        158        229        178
## 7        4552        587       4552        820        723        587
## 8        3506        741       3457        381        348        741
## 9        2262        405       2228        358        268        405
## 10       6193        671       6184       1466        768        671
##    B06012_018 B06012_019 B27020_002 B27020_006 B27020_008 B27020_012
## 1          31         12       2175         88        625         12
## 2          31         22       1717         38        151          0
## 3         124        183       4379        111        540         34
## 4          52         82       3694        152        281          0
## 5         151         58       3113        243        205          6
## 6          59         21       1483        143        116          8
## 7         107        103       3965        512        234         33
## 8          75        191       2759        184        465         99
## 9          66         39       1857        103        203         15
## 10        120         17       5513        463        194          0
##    B27020_013 B27020_017                       geometry
## 1         218         10 MULTIPOLYGON (((-122.2469 3...
## 2          92         35 MULTIPOLYGON (((-122.2574 3...
## 3         317         15 MULTIPOLYGON (((-122.2642 3...
## 4         190         21 MULTIPOLYGON (((-122.2618 3...
## 5         430        148 MULTIPOLYGON (((-122.2694 3...
## 6          62          0 MULTIPOLYGON (((-122.2681 3...
## 7         353        147 MULTIPOLYGON (((-122.2779 3...
## 8         276         79 MULTIPOLYGON (((-122.2887 3...
## 9         202         28 MULTIPOLYGON (((-122.2856 3...
## 10        477        111 MULTIPOLYGON (((-122.2787 3...

12.5 Rename the columns

Use the dataframe of census variables to rename the columns so that they are self-describing.

colnames(bay9_data2)<-c("GEOID", my_cenvar_df$my_cen_var_names, "geometry")

12.6 Take a look

bay9_data2
## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID citizenship_totpop citizenship_non_citizen entry_totpop
## 1  06001400100               3018                     218          944
## 2  06001400200               1960                      92          315
## 3  06001400300               5236                     317          900
## 4  06001400400               4171                     190          524
## 5  06001400500               3748                     430          701
## 6  06001400600               1661                      62          205
## 7  06001400700               4552                     353          619
## 8  06001400800               3506                     276          767
## 9  06001400900               2262                     202          446
## 10 06001401000               6193                     477          754
##    entry_2010 entry_2000_2009 birthplace_totpop birthplace_europ
## 1          61             176               843              208
## 2          33              55               243               65
## 3         130             314               857              119
## 4          69             115               471              146
## 5         226             123               635              160
## 6          21               9               178               49
## 7         122             177               587               70
## 8         108             185               741              181
## 9          20              16               405               20
## 10         90             186               671               56
##    birthplace_asian birthplace_latinAmerica birthplace_southAmerica
## 1               546                      46                      11
## 2               136                      28                      19
## 3                90                     257                      86
## 4               228                      35                      15
## 5                83                     245                      37
## 6                74                      38                       0
## 7               145                     299                      67
## 8               330                     149                       8
## 9               136                     148                      38
## 10               46                     519                      67
##    birthplace_other_nonUSA birthplace_byage_totpop birthplace_byage_fborn
## 1                       43                    3018                    843
## 2                       14                    1960                    243
## 3                      391                    5236                    857
## 4                       62                    4171                    471
## 5                      147                    3748                    635
## 6                       17                    1661                    178
## 7                       73                    4552                    587
## 8                       81                    3506                    741
## 9                      101                    2262                    405
## 10                      50                    6193                    671
##    poverty_totpop below_pov below_pov2 poverty_fborn_totpop
## 1            3011       113         20                  843
## 2            1952       106         84                  243
## 3            5153       450        217                  848
## 4            4158       268        198                  471
## 5            3733       339        240                  635
## 6            1656       158        229                  178
## 7            4552       820        723                  587
## 8            3457       381        348                  741
## 9            2228       358        268                  405
## 10           6184      1466        768                  671
##    below_pov_fborn below_pov2_fborn health_native_totpop
## 1               31               12                 2175
## 2               31               22                 1717
## 3              124              183                 4379
## 4               52               82                 3694
## 5              151               58                 3113
## 6               59               21                 1483
## 7              107              103                 3965
## 8               75              191                 2759
## 9               66               39                 1857
## 10             120               17                 5513
##    health_native_noinsurance health_fborn_nat_totpop
## 1                         88                     625
## 2                         38                     151
## 3                        111                     540
## 4                        152                     281
## 5                        243                     205
## 6                        143                     116
## 7                        512                     234
## 8                        184                     465
## 9                        103                     203
## 10                       463                     194
##    fborn_nohealth_naturalized health_fborn_noncit_totpop
## 1                          12                        218
## 2                           0                         92
## 3                          34                        317
## 4                           0                        190
## 5                           6                        430
## 6                           8                         62
## 7                          33                        353
## 8                          99                        276
## 9                          15                        202
## 10                          0                        477
##    fborn_nohealth_noncitizen                       geometry
## 1                         10 MULTIPOLYGON (((-122.2469 3...
## 2                         35 MULTIPOLYGON (((-122.2574 3...
## 3                         15 MULTIPOLYGON (((-122.2642 3...
## 4                         21 MULTIPOLYGON (((-122.2618 3...
## 5                        148 MULTIPOLYGON (((-122.2694 3...
## 6                          0 MULTIPOLYGON (((-122.2681 3...
## 7                        147 MULTIPOLYGON (((-122.2779 3...
## 8                         79 MULTIPOLYGON (((-122.2887 3...
## 9                         28 MULTIPOLYGON (((-122.2856 3...
## 10                       111 MULTIPOLYGON (((-122.2787 3...

13 Questions?

14 Extras for Enthusiasts

14.1 Fetching data for multiple years

This requires variable name to be the same across years!

# use purr::map_df to get data for multiple years (must have same vars!)
pop90_10 <- map_df(c(1990, 2000, 2010), function(x) { 
  get_decennial(geography = "state",
  variables = c(totalpop = "P001001"),
  dataset = "sf1",
  year = x) %>%
  mutate(year = x) }
)

# View output
head(pop90_10)
tail(pop90_10)

# Plot it
pop90_10 %>% ggplot(aes(x=reorder(NAME,value), y=value/1000000, fill=factor(year))) + 
             geom_bar(stat="identity", position=position_dodge()) + coord_flip()

15 Combining Census Data with Other Data

15.1 Area Weighted Interpolation

One of the strenghts of the sf package is how relatively easy it is to reaggregate data from one geometry to another. This process is called areal interpolation.

Area weighted interpolation reaggregates the data based on the percent of area shared by input and output geometeries.

15.2 Read in a Shapefile

sfnhoods<- st_read("data/sfnhoods.shp")
head(sfnhoods)
plot(sfnhoods['nhood'])

15.3 Check the CRS

st_crs(sfnhoods)
st_crs(sf_poor5)

15.4 CRS transformation

sf_poor5_4326 = st_transform(sf_poor5, st_crs(sfnhoods))

15.5 Area Weighted Interpolation

Reaggregate percent of people below poverty from census tract to neighborhood polygons.

sfhoods2 = st_interpolate_aw(sf_poor5_4326[, "pct_below_pov"], sfnhoods,
extensive = F) # True= aw sum; False= aw avg

15.6 Map it

par(mfrow=c(1,2))
plot(sf_poor5['pct_below_pov'])
plot(sfhoods2['pct_below_pov'])
par(mfrow=c(1,1))

15.7 Map it with tmap

tm_shape(sfhoods2) +
   tm_polygons(col="pct_below_pov")

15.8 Combine the values

head(sfhoods2)
sfnhoods$pct_below_pov <- sfhoods2$pct_below_pov

# map again - click on polygons and view data in popups
# to confirm the AWI output values
tm_shape(sfnhoods) +
  tm_polygons(col="pct_below_pov", 
    popup.vars = c("nhood", "pct_below_pov")
  )